Load all required packages for this report and define some global customization settings.
# for knitting the .rmd file to .html
if (! requireNamespace("knitr", quietly = TRUE)) {
install.packages("knitr")
}
if (! requireNamespace("kableExtra", quietly = TRUE)) {
install.packages("kableExtra")
}
# for creating models and analyzing expression data
if (! requireNamespace("edgeR", quietly = TRUE)) {
BiocManager::install("edgeR")
}
if (! requireNamespace("limma", quietly = TRUE)) {
BiocManager::install("limma")
}
if (! requireNamespace("dplyr", quietly = TRUE)) {
install.packages("dplyr")
}
if (! requireNamespace("gprofiler2", quietly = TRUE)) {
install.packages("gprofiler2")
}
# for creating plots
if (! requireNamespace("ggplot2", quietly = TRUE)) {
install.packages("ggplot2")
}
if (! requireNamespace("VennDiagram", quietly = TRUE)) {
install.packages("VennDiagram")
}
if (! requireNamespace("ComplexHeatmap", quietly = TRUE)) {
BiocManager::install("ComplexHeatmap")
}
if (! requireNamespace("circlize", quietly = TRUE)) {
BiocManager::install("circlize")
}
if (! requireNamespace("fgsea", quietly = TRUE)) {
BiocManager::install("fgsea")
}
if (! requireNamespace("qusage", quietly = TRUE)) {
BiocManager::install("qusage")
}
if (! requireNamespace("clusterProfiler", quietly = TRUE)) {
BiocManager::install("clusterProfiler")
}
if (! requireNamespace("org.Hs.eg.db", quietly = TRUE)) {
BiocManager::install("org.Hs.eg.db")
}
library(clusterProfiler)
library(org.Hs.eg.db)
library(edgeR)
library(dplyr)
library(ggplot2)
library(fgsea)
library(qusage)
library(VennDiagram)
# wraps lines that are too long
knitr::opts_chunk$set(tidy.opts=list(width.cutoff=80), tidy=TRUE)
# set default behaviors for all chunks
knitr::opts_chunk$set(warning = FALSE, message = FALSE)All table and figure outputs in this report are rendered using the knitr package [1], and kableExtra package [2] for styling. Some codes in this report are adapted from Lecture 10 - GSEA.
Glucocorticoids (GC) is a class of steroid hormones that plays a role in regulating the immune system and certain aspects of the immune function, such as inflammation. The most common naturally-produced GC hormone in humans is cortisol, which are produced by the adrenal glands. Due to GC’s potent anti-inflammatory effects, several synthetic forms of GC are used as immunosuppressive drugs to treat various medical conditions such as asthma, allergies, autoimmune disorders, and cancer. [3]
The data set used in this report comes from Quatrini et al. (2022)’s paper: Glucocorticoids inhibit human hematopoietic stem cell differentiation toward a common ILC precursor [3], where they analyzed the the role of GC on regulating innate lymphoid cells (ILCs) differentiation, including cytotoxic natural killer cells and helper ILCs, from human hematopoietic stem cells (HSCs). The RNA-seq data set used in this report is a part of the overall study; it evaluates the effect of the presence of Dexamethasone (DEX), a glucocorticoid medication, on gene expressions of HSCs. The raw data set is downloaded from the NCBI Gene Expression Omnibus, with ID GSE186950. [4]
In Assignment 1, I performed initial pre-processing and normalization of the data set. The raw data set contains 2 groups, control and conditioned (DEX), with 3 samples in each group: AF25, AF26, AF29. The data set initially contained gene expressions of 58683 genes, but 46243 genes were removed due to low counts with less than 1 count per million in at least 3 samples, leaving 12440 genes.
In the original data set, genes are labeled in a mix of HUGO gene symbols, GenBank accession IDs, EMBL identifiers, etc. Using the biomaRt package [5], we attempted to map non-HUGO identifiers to their corresponding HUGO gene symbols, but had to discard 847 (6.8%) genes with no matches, leaving a final set of 11593 unique genes.
knitr::include_graphics("figures/A2/raw_fltr_data_cpm.png")Figure 1. log2 CPM Distributions of gene expressions of each sample in the data set, after filtering out genes with low-counts or without HUGO gene symbol match, and before TMM normalization. This figure is adapted from Assignment 1 with slight aesthetic modifications.
As shown in Fig.1, The filtered data set is approximately normally distributed with no outlier samples, and therefore we normalized the data using the Trimmed Mean of M-values (TMM) method, via the edgeR package. [6]
In Assignment 2, using the normalized data set, we built two models to identify the set of significantly differentially expressed genes. Then, we selected one of the models (a treatment-only model built using the exact test of the edgeR package) and performed thresholded over-representation analysis on genes that are significantly differentially expressed, based on a threshold of FDR-correct p-value < 0.05.
In this assignment, we’ll start with the result from the
treatment-only exact test model and load it as an .rds file. The
top_et.rds file can be generated by running
saveRDS(top_et, 'top_et.rds') after running all code chunks
in A2_JunNi_Du.Rmd.
top_et <- readRDS("top_et.rds")
et_output <- top_et$table
# calculate rank for each gene
et_output[, "rank"] <- -log(et_output$PValue, base = 10) * sign(et_output$logFC)
et_output <- et_output[order(et_output$rank), ]rnk <- data.frame(rownames(et_output), et_output$rank)
colnames(rnk) <- c("GeneName", "rank")
write.table(rnk, file = "./dex_ctrl.rnk", sep = "\t", col.name = TRUE, row.names = FALSE,
quote = FALSE)gsea_upreg <- read.table(file = "gsea_report_for_na_pos_1680460999801.tsv", sep = "\t",
header = TRUE, fill = TRUE)
gsea_downreg <- read.table(file = "gsea_report_for_na_neg_1680460999801.tsv", sep = "\t",
header = TRUE, fill = TRUE)1. What method did you use? What genesets did you use? Make sure to specify versions and cite your methods.
2. Summarize your enrichment results.
The GSEA pre-ranked identified 2590 up-regulated gene sets out of 5263 total gene sets, with 327 gene sets having nominal p-value < 0.05 and 1 gene set having FDR q-value < 0.25, which indicates significant enrichment.
For down-regulated gene sets, 2673 down-regulated gene sets out of 5263 total gene sets were identified, with 345 gene sets having nominal p-value < 0.05 and 62 gene set having FDR q-value < 0.25, which indicates significant enrichment.
pathway_names <- sapply(stringr::str_split(gsea_upreg$NAME, "%"), "[", 1:2)
pathway_names <- t(pathway_names)
gsea_upreg <- cbind(pathway_names, gsea_upreg)
upreg <- gsea_upreg[, c("1", "2", "SIZE", "NES", "NOM.p.val", "FDR.q.val")]
colnames(upreg) <- c("Term", "Data Source", "Size", "Normalized Enrichment Score",
"Nominal p-value", "FDR q-value")
kableExtra::kable_paper(knitr::kable(head(upreg[which(upreg$"Nominal p-value" < 0.05),
]), format = "html", digits = 10))| Term | Data Source | Size | Normalized Enrichment Score | Nominal p-value | FDR q-value |
|---|---|---|---|---|---|
| NEUTROPHIL DEGRANULATION | REACTOME DATABASE ID RELEASE 83 | 418 | 2.066326 | 0 | 0.3181236 |
| OXIDATIVE PHOSPHORYLATION | GOBP | 93 | 2.066148 | 0 | 0.1602628 |
| RESPIRATORY ELECTRON TRANSPORT, ATP SYNTHESIS BY CHEMIOSMOTIC COUPLING, AND HEAT PRODUCTION BY UNCOUPLING PROTEINS. | REACTOME | 117 | 2.007220 | 0 | 0.4715452 |
| RESPIRATORY ELECTRON TRANSPORT | REACTOME DATABASE ID RELEASE 83 | 97 | 1.996420 | 0 | 0.4438916 |
| HALLMARK_OXIDATIVE_PHOSPHORYLATION | MSIGDBHALLMARK | 149 | 1.993352 | 0 | 0.3759924 |
| THE CITRIC ACID (TCA) CYCLE AND RESPIRATORY ELECTRON TRANSPORT | REACTOME | 161 | 1.968398 | 0 | 0.4876496 |
pathway_names <- sapply(stringr::str_split(gsea_downreg$NAME, "%"), "[", 1:2)
pathway_names <- t(pathway_names)
gsea_downreg <- cbind(pathway_names, gsea_downreg)
downreg <- gsea_downreg[, c("1", "2", "SIZE", "NES", "NOM.p.val", "FDR.q.val")]
colnames(downreg) <- c("Term", "Data Source", "Size", "Normalized Enrichment Score",
"Nominal p-value", "FDR q-value")
kableExtra::kable_paper(knitr::kable(head(downreg[which(downreg$"Nominal p-value" <
0.05), ]), format = "html", digits = 10))| Term | Data Source | Size | Normalized Enrichment Score | Nominal p-value | FDR q-value |
|---|---|---|---|---|---|
| HALLMARK_INTERFERON_ALPHA_RESPONSE | MSIGDBHALLMARK | 71 | -2.064144 | 0.000000000 | 0.000000000 |
| T CELL ACTIVATION | GOBP | 121 | -1.970077 | 0.000000000 | 0.009179046 |
| INTERFERON ALPHA BETA SIGNALING | REACTOME | 50 | -1.921399 | 0.001811594 | 0.033516850 |
| HALLMARK_INTERFERON_GAMMA_RESPONSE | MSIGDBHALLMARK | 144 | -1.895172 | 0.000000000 | 0.057725385 |
| POSITIVE REGULATION OF SMALL GTPASE MEDIATED SIGNAL TRANSDUCTION | GOBP | 39 | -1.847791 | 0.000000000 | 0.175786400 |
| LYMPHOCYTE ACTIVATION | GOBP | 201 | -1.846445 | 0.000000000 | 0.150864050 |
3. How do these results compare to the results from the thresholded analysis in Assignment #2. Compare qualitatively. Is this a straight forward comparison? Why or why not?
summary_A2 <- readRDS("summary_A2.rds")
upreg_comparison <- data.frame(summary_A2$`top_terms_up-reg`, tolower(upreg$Term[1:8]))
downreg_comparison <- data.frame(summary_A2$`top_terms_down-reg`, tolower(downreg$Term[1:8]))
colnames(upreg_comparison) <- c("A#2 ORA Top Terms (Up-Regulated)", "A#3 GSEA Top Terms (Up-Regulated)")
colnames(downreg_comparison) <- c("A#2 ORA Top Terms (Down-Regulated)", "A#3 GSEA Top Terms (Down-Regulated)")
kableExtra::kable_paper(knitr::kable(upreg_comparison, format = "html", digits = 10))| A#2 ORA Top Terms (Up-Regulated) | A#3 GSEA Top Terms (Up-Regulated) |
|---|---|
| inflammatory response | neutrophil degranulation |
| cell migration | oxidative phosphorylation |
| Metabolic pathways | respiratory electron transport, atp synthesis by chemiosmotic coupling, and heat production by uncoupling proteins. |
| Amino sugar and nucleotide sugar metabolism | respiratory electron transport |
| Neutrophil degranulation | hallmark_oxidative_phosphorylation |
| Innate Immune System | the citric acid (tca) cycle and respiratory electron transport |
| Nuclear receptors meta-pathway | respiratory electron transport chain |
| Glucocorticoid receptor pathway | atp synthesis coupled electron transport |
kableExtra::kable_paper(knitr::kable(downreg_comparison, format = "html", digits = 10))| A#2 ORA Top Terms (Down-Regulated) | A#3 GSEA Top Terms (Down-Regulated) |
|---|---|
| signaling | hallmark_interferon_alpha_response |
| cell communication | t cell activation |
| Hematopoietic cell lineage | interferon alpha beta signaling |
| Asthma | hallmark_interferon_gamma_response |
| Cytokine Signaling in Immune system | positive regulation of small gtpase mediated signal transduction |
| Signal Transduction | lymphocyte activation |
| Pathogenesis of SARS-CoV-2 mediated by nsp9-nsp10 complex | antigen processing and presentation of exogenous antigen |
| T-cell receptor signaling pathway | antigen processing and presentation |
1. Create an enrichment map - how many nodes and how many edges in the resulting map? What thresholds were used to create this map? Make sure to record all thresholds. Include a screenshot of your network prior to manual layout.
The enrichment map was created using the following thresholds:
The enrichment map has 63 nodes and 71 edges.
knitr::include_graphics("figures/A3/GseaPreranked.png")2. Annotate your network - what parameters did you use to annotate the network. If you are using the default parameters make sure to list them as well.
Using the AutoAnnotate App available in Cytoscape with the following parameters:
3. Make a publication ready figure - include this figure with proper legends in your notebook.
The enrichment map below was created using the parameters above, arranged used GoSE layout and spatially-rearranged to be more compact.
knitr::include_graphics("figures/A3/EM_grouped.png")4. Collapse your network to a theme network. What are the major themes present in this analysis? Do they fit with the model? Are there any novel pathways or themes?
knitr::include_graphics("figures/A3/theme.png")Per the above figure, some of the major themes include immune system (e.g.lymphocyte mediated immunity, antigen receptor-mediated signaling pathway, regulation of interferon-beta production, defense response to virus, etc.) and cell signaling (e.g. signal transduction protein, nervous system axonogenesis, activation involved response). All the gene sets in these themes are down-regulated, which fits with our models where the presence of dexamethasone downregulates immune system responses and cell signaling, which is the mechanism for its anti-inflammatory property.
Two novel themes that are shown in the annotated clusters and the theme network, but not in the top hits of the models, are ‘p53 class mediator’ and ‘sequestering calcium ion’. Additional discussion regarding these findings can be found in question 2 of Interpretation and Detailed View of Results.
Background and pre-processing of the data set can be found here:
Questions from previous sections can be found here:
1. Do the enrichment results support conclusions or mechanism discussed in the original paper? How do these results differ from the results you got from Assignment #2 thresholded methods
2. Can you find evidence, i.e. publications, to support some of the results that you see. How does this evidence support your result?
Two novel themes that are shown in the annotated clusters and the theme network, but not in the top hits of the models, are ‘p53 class mediator’ and ‘sequestering calcium ion’. p53 is a tumor suppressor protein that regulates cell division. In a paper by Sengupta et al. (2000), it is said that dexamethasone-activated endogenous and exogenous glucocorticoid receptor inhibit p53-dependent functions@p53, so it makes sense to see this theme down-regulated in our network.
Using your networks and results from the previous section add one of the following:
I chose to answer question 3, regarding dark matter.
3. Sometimes the most interesting information is the gene that has no information. In this type of pathway analysis we can only discover what we have already described previously in the literature or pathway databases. Often pathways found in one disease are applicable to other diseases so this technique can be very helpful. It is important to highlight any genes that are significantly differentially expressed in your model but are not annotated to any pathways. We refer to this set of genes as the dark matter.
# extract out the sets of genes required for the Venn diagram.
gene_sets <- fgsea::gmtPathways("Human_GOBP_AllPathways_no_GO_iea_March_02_2023_symbol.gmt")
Annotated <- unique(unlist(gene_sets))
Enriched <- unique(unlist(gene_sets[c(gsea_upreg$NAME, gsea_downreg$NAME)]))
genes_expressed <- readRDS("raw_fltr_matrix.rds")
Expressed <- unique(rownames(genes_expressed))venn_diag <- draw.triple.venn(area1 = length(Annotated), area2 = length(Enriched),
area3 = length(Expressed), n12 = length(intersect(Annotated, Enriched)), n13 = length(intersect(Annotated,
Expressed)), n23 = length(intersect(Enriched, Expressed)), n123 = length(intersect(Annotated,
intersect(Enriched, Expressed))), category = c("Annotated", "Enriched", "Expressed"),
fill = hcl.colors(3, palette = "Sunset"), fontfamily = "Arial", cat.fontfamily = "Arial",
cat.cex = 1.2, )
grid::grid.draw(venn_diag)Per the Venn diagram, 9479 out of 11593 filtered expressed genes (81.76%) were annotated and enriched. This is ideal as it is an indication that the outputs of our enrichment analyses is a good representation of the expression data. Here, we’ll focus on the 2114 genes that were not enriched, which contains 1620 expressed genes that are not enriched nor annotated, and 494 genes that are annotated but not enriched.
# set of genes that are annotated but not enriched
anno_no_enrich <- setdiff(intersect(Expressed, Annotated), Enriched)
anno_no_enrich <- et_output[anno_no_enrich, ]
# filter out genes that are not significantly differentially expressed
anno_no_enrich <- anno_no_enrich[which(anno_no_enrich$FDR < 0.05), ]
# set of genes with no annotation
no_anno <- setdiff(Expressed, Annotated)
no_anno <- et_output[no_anno, ]
# filter out genes that are not significantly differentially expressed
no_anno <- no_anno[which(no_anno$FDR < 0.05), ]rmarkdown::paged_table(anno_no_enrich[order(anno_no_enrich$rank, decreasing = TRUE),
c("FDR", "rank")])rmarkdown::paged_table(no_anno[order(no_anno$rank, decreasing = TRUE), c("FDR", "rank")])# creating normalized CPM for heatmap
dge <- edgeR::DGEList(counts = genes_expressed, group = c("CTRL", "CTRL", "CTRL",
"DEX", "DEX", "DEX"))
# calculate normalization factor
dge <- calcNormFactors(dge)
# get the normalized data
normalized_cpm <- cpm(dge)3.i. Include a heatmap of any significant genes that are not annotated to any of the pathways returned in the enrichment analysis.
# creating heatmap matrix
heatmap_matrix_1 <- t(scale(t(normalized_cpm[rownames(anno_no_enrich), ])))
heatmap_col = circlize::colorRamp2(c(min(heatmap_matrix_1), 0, max(heatmap_matrix_1)),
c("blue", "white", "red"))
# creating treament annotation
treat_colours <- c("orange", "darkblue")
names(treat_colours) <- c("DEX", "CTRL")
treatment_col <- ComplexHeatmap::HeatmapAnnotation(df = data.frame(treatment = rep(c("CTRL",
"DEX"), c(3, 3))), col = list(treatment = treat_colours))
# Create the heatmap
heatmap_anno_no_enrich <- ComplexHeatmap::Heatmap(as.matrix(heatmap_matrix_1), cluster_rows = TRUE,
cluster_columns = TRUE, show_row_dend = TRUE, show_column_dend = TRUE, col = heatmap_col,
show_column_names = TRUE, show_row_names = FALSE, show_heatmap_legend = TRUE,
top_annotation = treatment_col, column_title = "Differential Gene Expressions of Genes That Are Annotated but Not Enriched",
name = "row-normalized \ngene expression")
heatmap_anno_no_enrichAs we can see from the heatmap, there are some distinction of expression pattern between the control and conditioned group. This distinction is not definite (e.g. on the bottom left, we can see a few areas where genes are up-regulated in two of the three samples but down-regulated in the other). Although the distinction does exist, this heat map only contains 28 genes (significantly differentially expressed, annotated, but not enriched), and is unlikely to contain any major significant pathways that we did not already identify.
3.ii. Include a heatmap of any significant genes that are not annotated to any pathways in entire set of pathways used for the analysis.
# creating heatmap matrix
heatmap_matrix_2 <- t(scale(t(normalized_cpm[rownames(no_anno), ])))
heatmap_col = circlize::colorRamp2(c(min(heatmap_matrix_2), 0, max(heatmap_matrix_2)),
c("blue", "white", "red"))
# creating treatment annotation
treat_colours <- c("orange", "darkblue")
names(treat_colours) <- c("DEX", "CTRL")
treatment_col <- ComplexHeatmap::HeatmapAnnotation(df = data.frame(treatment = rep(c("CTRL",
"DEX"), c(3, 3))), col = list(treatment = treat_colours))
# Create the heatmap
heatmap_no_anno <- ComplexHeatmap::Heatmap(as.matrix(heatmap_matrix_2), cluster_rows = TRUE,
cluster_columns = TRUE, show_row_dend = TRUE, show_column_dend = TRUE, col = heatmap_col,
show_column_names = TRUE, show_row_names = FALSE, show_heatmap_legend = TRUE,
top_annotation = treatment_col, column_title = "Differential Gene Expressions of Genes That Are Not Annotated",
name = "row-normalized \ngene expression")
heatmap_no_anno
Similar to the previous heatmap (annotated but not enriched), this
heatmap also clustered some distinction in gene expression patterns
between the two groups, but each cluster is not very homogeneous. For
example, there exists bands of blue (downregulated) genes in the red
area (cluster of up-regulated genes).Although the distinction does
exist, this heat map only contains 142 genes (significantly
differentially expressed but not annotated). A solution to these dark
matter genes is to use a larger set of pathways that could potentially
include these genes, but the downside to this is that it could also
introduce less-relevant pathways in our network result.